XML and Knowledge Technologies for Semantic-Based Indexing of Paper Documents

نویسندگان

Donato Malerba

Michelangelo Ceci

Margherita Berardi

چکیده

Effective daily processing of large amounts of paper documents in office environments requires the application of semantic-based indexing techniques during the transformation of paper documents to electronic format. For this purpose a combination of both XML and knowledge technologies can be used. XML distinguishes between data, its structure and semantics, allowing the exchange of data elements that carry descriptions of their meaning, usage and relationship. Moreover, the combination with XSLT enables any browser to render the original layout structure of the paper documents accurately. However, an effective transformation of paper documents into XML format is a complex process involving several steps. In this paper we propose the application of knowledge technologies to many document processing steps, namely rule-based systems for semantic indexing of documents and the extraction of the necessary knowledge by means of machine learning techniques. This approach has been implemented in the system Wisdom++, which is currently used in the European project COLLATE (Collaboratory for Annotation, Indexing and Retrieval of Digitized Historical Archive Material) to provide film archivists with a tool for the automated annotation of historical documents in film archives.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی فراابتکاری اسناد فارسی اِکس‌اِم‌اِل مبتنی بر شباهت ساختاری و محتوایی

Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...

متن کامل

Retrieving Video Segments Based on Combined Text, Speech and Image Processing

This paper describes a multimedia, multilingual and multimodal research system (CIMWOS) supporting content-based indexing, archiving, retrieval and ondemand delivery of audiovisual content. There are several projects, aiming at developing advanced technologies and systems to tackle the problems encountered in multimedia archiving and indexing [8], [9], [10]. CIMWOS [1] (Combined IMage and WOrd ...

متن کامل

Embedding Knowledge in Web Documents: CGs versus XML-based Metadata Languages

The paper argues for the use of general and intuitive knowledge representation languages for indexing the content of Web documents and representing knowledge within them. We believe these languages have advantages over metadata languages based on the Extensible Markup Language (XML). Indeed, the representation and retrieval of precise information is better supported by languages designed to rep...

متن کامل

A Bayesian Approach to WSD for the Retrieval of XML Documents

Sources of XML documents are today proliferating on the World Wide Web. An important feature of XML is that information on documents structures is available on the Web together with the documents contents. This information can be exploited to improve document handling and to improve query processing. In such an heterogeneous environment as the Web, it is not reasonable to assume that there are ...

متن کامل

Mapping Xml to Existing Owl Ontologies

Now-a-days, XML has reached a wide recognition and brought interoperability at a syntactic level. Unfortunately, even when using XML to represent data, problems arise when it is necessary to integrate different data sources because XML lacks support for efficient sharing of conceptualization. Emerging Semantic Web technologies, such as ontologies, can enable semantic interoperability. With onto...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2003

XML and Knowledge Technologies for Semantic-Based Indexing of Paper Documents

نویسندگان

چکیده

منابع مشابه

خوشه‌بندی فراابتکاری اسناد فارسی اِکس‌اِم‌اِل مبتنی بر شباهت ساختاری و محتوایی

Retrieving Video Segments Based on Combined Text, Speech and Image Processing

Embedding Knowledge in Web Documents: CGs versus XML-based Metadata Languages

A Bayesian Approach to WSD for the Retrieval of XML Documents

Mapping Xml to Existing Owl Ontologies

عنوان ژورنال:

اشتراک گذاری